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Abstract  A  data  mining  based  procedure  for  automated  reverse  engineering  has  been  developed.  The  data  mining 
algorithm  for  reverse  engineering  uses  a  genetic  program  (GP)  as  a  data  mining  function.  A  genetic 
program  is  an  algorithm  based  on  the  theory  of  evolution  that  automatically  evolves  populations  of 
computer  programs  or  mathematical  expressions,  eventually  selecting  one  that  is  optimal  in  the  sense  it 
maximizes  a  measure  of  effectiveness,  referred  to  as  a  fitness  function.  The  system  to  be  reverse  engineered 
is  typically  a  sensor.  Design  documents  for  the  sensor  are  not  available  and  conditions  prevent  the  sensor 
from  being  taken  apart.  The  sensor  is  used  to  create  a  database  of  input  signals  and  output  measurements. 
Rules  about  the  likely  design  properties  of  the  sensor  are  collected  from  experts.  The  rules  are  used  to 
create  a  fitness  function  for  the  genetic  program.  Genetic  program  based  data  mining  is  then  conducted. 
This  procedure  incorporates  not  only  the  experts’  rules  into  the  fitness  function,  but  also  the  information  in 
the  database.  The  information  extracted  through  this  process  is  the  internal  design  specifications  of  the 
sensor.  Significant  mathematical  formalism  and  experimental  results  related  to  GP  based  data  mining  for 
reverse  engineering  will  be  provided. 


1  INTRODUCTION  properties  of  the  sensor  are  collected  from  experts. 

The  rules  are  used  to  create  a  fitness  function  for  the 

An  engineer  must  design  a  signal  that  will  yield  a  Senetic  Program.  Genetic  program  based  data 

particular  type  of  output  from  a  sensor  device  (SD).  mlnm8  18  then  conducted  (Bigus  1 996,  Smith  2003a, 

The  engineer  does  not  have  design  specifications  for  2003b,  2004).  This  procedure  incorporates  not  only 

the  sensor  system  and  the  machine  may  not  be  exPerts’  rules  into  the  fitness  function,  but  also  the 

disassembled  or  invasively  examined.  The  engineer  information  in  the  database.  The  information 

might  attempt  to  find  the  correct  signal  through  trial  extracted  through  this  process  is  the  internal  design 

and  error,  but  this  would  be  very  time  consuming  specifications  of  the  sensor.  The  design  properties 

and  access  to  experimental  resources  is  very  extracted  through  this  process  can  be  used  to  design 

expensive.  To  deal  with  this  problem  a  genetic  a  S18nal  that  wlU  Produce  a  desired  outPut  <Smlth 

program  (GP)  based  data  mining  (DM)  procedure  2005)-  Determination  of  such  signals  can  be 

has  been  invented  (Smith  2005).  essential  to  ultimate  determination  of  control  rules 

A  genetic  program  is  an  algorithm  based  on  the  Por  automatic  mulfiplatform  coordination  (Smith 

theory  of  evolution  that  automatically  evolves  2003a,  2003b,  2004). 

populations  of  computer  programs  or  mathematical  .  GPs  re(lmre  a  termmal  set  and  functlon  set  as 

expressions,  eventually  selecting  one  that  is  optimal  inputs.  The  terminals  are  the  actual  variables  of  the 

in  the  sense  it  maximizes  a  measure  of  effectiveness,  problem.  These  can  include  a  variable  like  x  used 

referred  to  as  a  fitness  function  (Koza  1999;  Smith  as  a  symbo1  in  buildin8  a  Polynomial  and  also  real 

2003a,  2003b,  2004).  The  system  to  be  reverse  constants.  The  function  set  consists  of  a  list  of 

engineered  is  typically  a  sensor.  The  sensor  is  used  functions  that  can  operate  on  the  variables.  When  a 

to  create  a  database  of  input  signals  and  output  GP  was  used  as  a  DM  functlon  ln  the  Past  t0 

measurements.  Rules  about  the  likely  design  automatically  create  fuzzy  decision  trees,  the 
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terminals  consisted  of  fuzzy  root  concepts  and  the 
functions  consisted  of  fuzzy  logical  connectives  and 
fuzzy  modifiers  (Smith  2003a,  2003b,  2004). 

When  the  GP  is  used  as  a  data  mining  function,  a 
database  of  input  and  output  infonnation  is  required. 
When  the  GP  is  used  as  a  data  mining  function  for 
evolving  digital  logic  (DL),  the  database  contains 
inputs  to  the  DL  as  well  as  measured  outputs.  The 
experts’  opinions  are  manifested  in  the  selection  of 
the  input  and  associated  output  to  be  included  in  the 
database.  For  the  DL  case  an  additional  form  of 
input  consisting  of  “rules”  about  DL  construction 
are  included. 

Section  2  discusses  data  mining  and  the  use  of  a 
genetic  program  as  a  data  mining  function.  Section 
3  examines  one  of  the  digital  logic  designs  to  be 
reverse  engineered  using  genetic  program  based  data 
mining.  Section  4  explains  the  genetic  program’s 
terminal  set,  function  set,  and  fitness  function. 
Section  4  also  gives  detailed  formulations  of  the  rule 
fitness,  fitness  score,  input-output  fitness,  and 
overall  fitness.  Section  5  provides  experimental 
results  with  detailed  descriptions  of  the  evolutionary 
properties.  Finally,  section  6  provides  conclusions. 


2  GP  BASED  DATA  MINING 

Data  mining  is  the  efficient  extraction  of  valuable 
non-obvious  information  embedded  in  a  large 
quantity  of  data  (Bigus  1996).  Data  mining  consists 
of  three  steps:  the  construction  of  a  database  that 
represents  truth;  the  calling  of  the  data  mining 
function  to  extract  the  valuable  infonnation,  e.g.,  a 
clustering  algorithm,  neural  net,  genetic  algorithm, 
genetic  program,  etc;  and  finally  detennining  the 
value  of  the  infonnation  extracted  in  the  second 
step,  this  generally  involves  visualization. 

When  used  for  reverse  engineering,  the  GP, 
typically  data  mines  a  database  to  detennine  a 
graph-theoretic  structure,  e.g.,  a  system’s  DL 
diagram  or  an  algorithm’s  flow  chart  or  decision  tree 
(Smith  2003a,  2003b,  2004).  The  GP  mines  the 
infonnation  from  a  database  consisting  of  input  and 
output  values,  e.g.,  a  set  of  inputs  to  a  sensor  and  its 
measured  outputs.  GP  based  data  mining  will  be 
applied  to  the  construction  of  the  DLs  described  in 
sections  3  and  5. 

To  use  the  genetic  program  it  is  necessary  to 
construct  terminal  and  function  sets  relevant  to  the 
problem.  Before  the  specific  tenninal  and  function 
sets  for  the  reverse  engineering  problems  are 
described,  a  more  detailed  description  of  one  of  the 
digital  logic  examples  to  be  considered  will  be  given 


in  section  3 . 


3  DIGITAL  LOGIC  TO  BE 
REVERSE  ENGINEERED 

The  first  DL  design  to  be  reverse  engineered  is 
given  in  prefix  notation  in  (1)  and  is  depicted 
diagrammatically  in  Figure  1, 

OR2  OR3DELAYAND3  H,  MAX_SIG123  H 3 
SUM_SIG2  H2  DIFF  SUM_SIG2  SUM_S1G123 

OR3DELAYAND3  H,MAX_SIG123  H3  O) 
SUM_SIG3  H,  DIFF  SUM_SIG3  SUM_SIG123. 

The  notation  is  described  in  (Smith  2005)  and 
summarized  in  this  section.  This  DL  is  not  known 
to  the  GP.  The  GP  only  has  access  to  a  database  of 
input  signals  to  the  DL  and  measured  output,  as  well 
as,  a  database  of  rules  provided  by  experts  for 
building  the  DL. 


SOURCE  2  SOURCE  1  SOURCE  3 


The  DL  consists  of  three  input  channels  each 
with  a  sensor  attached.  The  sensors  receive  signals 
from  sources  one,  two  and  three.  Only 

measurements  from  the  central  source  in  Figure  1 
are  of  interest.  Due  to  the  geometry  of  the  sources 
and  properties  of  the  sensors  only  sensor  two  can 
receive  emissions  from  the  central  source  that  are 
significant.  Unfortunately,  sensor  two’s 

measurement  may  be  corrupted  by  emissions  from 
the  other  two  sources.  The  digital  logic  is 


constructed  so  that  if  there  were  significant 
corruption  of  sensor  two’s  measurements,  then  the 
final  OR-gate  returns  unity,  so  the  measurements 
can  be  ignored. 

There  are  a  number  of  DL  elements  that  are  used 
repeatedly.  The  DL  components  and  signals  will 
ultimately  become  elements  of  the  GP’s  terminal 
and  function  sets.  The  sensors  will  receive  an 
analog  signal  and  convert  it  to  a  digital  form,  i.e., 
they  will  map  real-valued  input  to  the  set  of  integers. 
A  sampling  window  of  size  N  is  used,  i.e.,  the  signal 
is  sampled  every  At  seconds  for  a  total  of  N  samples 
in  that  window.  The  sample  is  indicated  by  the 
vector  s j  in  (2)  with  sampling  beginning  at  time  t0. 

The  j-subscript  implies  the  signal  originated  in  the  j'h 
source,  where  j=l,2,3, 

Sj=[sj(to),Sj(t0+At),  ...  , 

sj(t0+(N-l)-At)\ 

The  DL  function,  SUM,  given  explicitly  in  (3), 
represents  the  logarithmic  sum  of  the  absolute  value 
of  the  time  components  of  the  digitized  input  that 
has  been  received  for  a  single  window  of  length  N, 


SUM  (sj)  =  log 


N 

I 


k=l 


\sj(t0+(k-l)-At)\ 


(3) 


The  elements  labeled  Hh  for  i=l,2,3,  are 
Heaviside  step  functions  as  given  in  (4).  If  the  input 
is  greater  than  or  equal  to  a  threshold,  Tt,  for  i=l,2,3 ; 
then  a  value  of  unity  is  transmitted,  otherwise  a  zero 
is  transmitted, 


input.  It  waits  until  it  has  three  consecutive  inputs 
from  three  consecutive  time  windows,  hence  the  “3” 
in  its  name.  Once  it  receives  three  consecutive 
inputs,  it  yields  as  an  output  the  maximum  of  its 
inputs.  Also  not  depicted,  but  used  in  the  GP’s 
function  set  are  AND3DELAY,  which  takes  three 
inputs  of  zero  or  one  corresponding  to  three 
consecutive  time  windows  and  yields  as  output  the 
minimum  of  its  inputs.  Finally,  the  symbols  labeled 
AND3,  OR3,  AND2,  and  OR2  are  the  conventional 
logical  connectives  AND  and  OR,  with  the  numerical 
designation  indicating  the  number  of  inputs 
expected,  e.g.,  AND3  expects  three  Boolean  inputs. 

The  signals  are  additive,  at  any  given  time  sensor 
two  may  record  a  superposition  of  the  three  sources’ 
transmissions,  which  is  represented  by  sft)+  s2(t)  + 
s3(t).  If  the  three  sensors’  signals  are  of  sufficient 
magnitude  then  this  is  characteristic  of  corruption 
and  the  final  OR  in  Figure  1  returns  unity. 


4  GP  TERMINAL  SET,  FUNCTION 
SET  AND  FITNESS 

This  section  describes  the  GPs  terminal  set,  function 
set,  and  the  fitness  functions.  The  description  is 
given  in  terms  of  DL  elements  and  properties,  but 
the  genetic  program  based  reverse  engineering 
technique  is  very  general  and  can  be  applied  to  any 
system  that  can  be  described  in  a  graph  theoretic 
language,  e.g.,  decision  processes  described  in  terms 
of  decision  trees  (Smith  2003a,  2003b,  2004). 

The  terminal  set  consists  of  the  following 
elements: 


Hi(s)  = 


if  S>  Tj 

if  S<  Tj  ' 


(4) 


The  DL  function,  MAX,  given  in  (5),  returns  the 
common  logarithm  of  the  maximum  absolute  value 
of  the  time  components  of  the  input  signal  for  a 
single  window  of  length  N.  The  element  labeled 
DIFF,  takes  the  difference  between  input  to  its  first 
and  second  arguments  as  indicated  in  (5). 


MAX(s  j)  =  log 


N 
V 
k= 1 


\sj(t0+(k-l)-At)\ 


(5) 


DIFF(  h,I2)  =  h~I2  (6) 


T={SUM_SIG123,  MAX_SIG123, 

SUM  SIG2,  MAX_SIG2,  SUM  SIG3,  (7) 
MAXSIG3}, 

where 

SUMSIG123  =  SUM(  Jj  +  J2  +  ?3 ),  (8) 

MAX_SIG123  =  MAX(s{  +  s2  +  s3),  (9) 

SUMSIG2  =  SUM(s2),  (10) 

MAX_SIG2  =  MAX(  s2),  (11) 

SUMJIG3  =  SUM(s2),  (12) 


The  DL  function,  OR3DELAY,  takes  only  MAXSIG3  =MAX(s2).  (13) 

Boolean  inputs,  i.e.,  it  expects  zero  or  one  as  an 


All  sensor  measurements  begin  at  time,  t0. 

The  function  set  consists  of  the  following 
elements: 

F=  {AND 3,  OR3,  AND2,  OR2,  (14) 

AND 3 DELAY,  OR3 DELAY,  Hh  H:,  H}, 

DIFF }. 

The  function  AND3DELAY  is  not  used  for  the  DL 
under  consideration.  By  including  it,  the  GP’s 
ability  to  discriminate  against  extraneous  functions 
is  emphasized. 

The  DL  design  to  be  evolved  by  the  GP  is  given 
in  (1).  The  GP’s  ability  to  do  this  will  be 
determined  largely  by  the  fitness  function  and  the 
underlying  databases  to  be  discussed. 

As  with  all  GPs  there  must  be  a  fitness  function 
for  evaluation  of  the  evolving  population  of 
chromosomes.  The  fitness  function,  referred  to  as 
the  overall  fitness  (OF)  denoted  as  fOF  is  actually 
the  sum  of  two  other  fitness  functions.  These 
functions  are  the  rule  fitness  (RF)  and  the  input- 
output  fitness  (IOF)  denoted  as  fRF  and  fIOF, 
respectively.  The  rule  fitness  is  given  in  (15)  where 
the  indicator  function,  7,-is  unity  if  the  1th  rule  is 
satisfied  and  zero  otherwise,  and  v;is  the  value  of 
the  ith  rule.  Table  1  provides  a  small  subset  of  the  12 
rules  used. 

Irf  =  'L,r  vi  (15) 

(=1 

Let  DL  I  denote  the  f'  element  of  the  evolving 

population  of  chromosomes  within  the  GP  for 
j  =  1,2 . mps  where  mps  is  the  population  size,  i.e., 

the  number  of  chromosomes.  Let  each  DL  •  consists 

of  an  OR2  or  AND2  that  connects  two  subgraphs, 
denoted  as  DLj  left  and  DLj  right .  Let 

1{dLj  _£■]  be  the  length,  i.e.,  the  number  of  nodes 
in  DL:  _g  ,  for  g  e  {left, right}  .  If  l{DLj_g j  is 

greater  than  or  equal  to  20  then  the  parsimony 
pressure,  ap  ■l{DLj_g)  is  subtracted  from  the  rule 

fitness  followed  by  division  by  100,  ultimately 
yielding  the  rule  score,  denoted  as  gRS .  This 
subtraction  is  done  if  either  l(pLj  _left)  or 
l[pLj_right)  exceeds  20.  The  quantity ap  is 

referred  to  as  the  parsimony  coefficient  (Smith 
2005).  The  rule  score  is  expressed  compactly  as 


8rs  iDLj )=j^\fRF  K  Jeft)~ 

x[l{DLj  left)- 2o\ a p l(l DLj  Jeft)}  (16) 

+  7 

x[i(dLj  right)- 2o\ a p  •/(, DLj  _right)}, 

where  the  Fleaviside  step  function  %  takes  the  value 
unity  for  non-negative  arguments  and  is  zero 
otherwise.  If  the  rule  score  exceeds  the  rule 
threshold  denoted  as,  krt  then  and  only  then  is  the 
input-output  fitness  evaluated.  By  forcing  the  rule 
score  to  exceed  a  threshold  before  the  input-output 
fitness  is  evaluated  a  great  deal  of  computational 
complexity  is  avoided. 

T 

Let  DL  denote  the  true  digital  logic  diagram 
that  underlies  the  SD  used  to  construct  the  input- 
output  database.  For  the  examples  considered  in  this 
paper  let  there  be  three  signals.  The  input-output 
database  is  assumed  to  have  the  following  structure 


'  -»1 
Si 

-* 1 

S  2 

-* 1 
S3 

B1 

-*•2 

Si 

-*■2 

S  2 

-*■2 

S3 

B2 

(17) 

—■m 

-*m 

-*m 

si 

s2 

S3 

Bm 

-+K  .  ...  tU 

where  S  j  is  the  three  time  window  input  from  the  j 
source  for  the  k,h  input;  Bk  e  {<9,7}  is  the  k,h  output 
from  Dll  for  k=l,2 . m,  i.e., 


,  T(  -+k  -*k  -+k  \ 

5=  Dlf  S1.S2.S3  ; 


for  k  =  l,2 . m. 


(18) 


Table  1:  Subset  of  the  rule  set  for  computational  GP 
experiments 

Rl:  If  either  OR3DELAY  or  AND 3 DELAY  are  present 
during  rule  fitness  evaluation  add  vt  =  5  . 

R2:  If  AND 3  or  OR3  are  present  during  fitness 
evaluation  add  V-,  =  5  . 


The  input-output  fitness  for  the  j,h  chromosome, 
DLj  ,  is  defined  as 

f  IOF  [j '  M  db  )  = 

f - - - 71  (19) 

m  f -*k  -*k  -*k\  , 

1+  x  DLj\  Si,S2,S3  I- Bk 

The  overall  fitness,  fOF  ,  for  the  j‘h  chromosome 
can  be  written  as 

foF  O'.  Mdb  )  =  gas  [DLj )+  (20) 

X  (?  rs  (DLj  [~krt\  f,OF  ( j,M  db )• 

It  is  important  to  recall  that  in  actual 
implementation,  the  input-output  fitness  is  only 
evaluated  if  the  rule  fitness  is  greater  than  or  equal 
to  the  rule  threshold.  Selectively  evaluating  the 
input-output  fitness  greatly  reduces  the 
computational  complexity  and  hence  the  run-time  of 
the  GP. 


5  DATA  MINING  RESULTS 

In  this  section  two  different  DL  schemes  data 
mined  by  the  GP  are  considered.  The  two  examples 
presented  here  are  representative  of  the  many 
experiments  that  have  been  conducted  to  show  the 
effectiveness  of  the  GP  based  data  mining  procedure 
presented  in  this  paper.  The  first  is  the  DL 
represented  in  (1)  and  also  in  Figure  1.  This  DL  will 

T 

assume  the  value  of  DL  for  the  discussion  below. 
Using  various  databases  too  large  to  reproduce  here 
and  different  random  number  generator  seeds,  the 
GP  was  able  to  reverse  engineer  ( 1 )  in  no  more  than 
76  GP  generations.  The  different  number  of 
generations  and  amounts  of  CPU  time  required 
reflects  the  effect  of  different  input-output  databases 
and  also  the  random  number  generator  seeds.  One 
database  may  constrain  the  evolutionary  process 
more  than  another  resulting  in  fitness  values  that 
over  time  push  the  population  more  rapidly 

T 

toward  DL  .  Also,  since  the  initial  population  is 
generated  randomly;  and  crossover,  mutation, 
architecture  altering  steps  (Smith  2005)  (AAS)  and 
symmetrical  replication  (Smith  2005)  (SR)  have 
random  aspects,  a  change  in  the  seed  of  the  random 
number  generator  can  also  impact  run-time. 


To  get  a  feel  for  the  evolutionary  process  it  is 
useful  to  examine  some  intermediate  generations 

that  lead  to  DL  .  For  the  case  in  which  (1)  is 
reverse  engineered  in  76  generations  the  elite 
chromosomes  found  for  different  generations  are 
provided  in  Table  2  with  the  76th  generation 
reproducing  the  correct  chromosome  given  in  (1). 

The  chromosomes  entered  into  Table  2  reflect 
some  of  the  characteristics  observed  during  the 
evolutionary  process.  From  the  first  generation 
forward  the  GP  is  able  to  find  best  candidates  that 
have  an  OR2  at  the  end  of  the  chromosome.  The 
presence  of  two  DIFF  operators  in  the  first 
generation  is  also  promising.  The  best  chromosome 
for  the  first  generation  is  much  too  short  when 
compared  to  the  desired  result. 

New  innovations  are  found  in  generation  25  in 
that  both  arguments  of  both  DIFFs  use  MAX 
functions  as  well  as  the  SIG123  structure.  Even 
though  it  is  expected  that  both  arguments  will 
ultimately  use  SUM  functions,  the  use  of  a  common 
function  for  both  arguments  may  show  evolution  in 
the  proper  direction.  Both  DIFF  operators  are 
preceded  by  H2  which  is  what  is  found  in  (1).  Even 
with  these  innovations  the  best  chromosome  of 
generation  25  is  far  from  the  correct  result. 

All  generations  after  the  26th  have  elite 
chromosomes  that  have  underlying  graphs 
isomorphic  to  the  final  solution.  The  GP’s  effort 
from  generation  27  through  76  involves  finding  a 
solution  with  the  proper  node  labels.  Various  rows 
in  the  input-output  (10)  database,  i.e.,  (17) 
contribute  to  proper  labeling,  e.g.,  if  a  certain  row  in 
the  database  is  deleted  then  it  is  likely  the  final  GP 
solution  would  not  have  proper  threshold  labeling. 
An  improper  threshold  value  is  undesirable  from  the 
standpoint  of  trying  to  reproduce  the  exact  digital 
logic.  If  the  goal  is  to  produce  an  input  signal  that 
yields  unity  as  an  output  then  even  with  the 
threshold  value  wrong,  as  long  as  the  input  signal 
has  sufficient  energy  to  take  into  account 
uncertainty,  then  the  desired  output  is  obtained.  In 
conclusion,  the  ultimate  cost  of  infonnation 
uncertainty  in  this  case  is  a  small  amount  of 
additional  power. 

The  best  chromosome  of  the  50th  generation  is 
far  closer  to  (1).  The  MAX  functions  in  the 
arguments  of  the  DIFF  have  been  replaced  by  SUM 
functions.  The  arguments  of  the  DIFF  operators  are 
the  ones  for  the  final  result  and  the  output  of  both 
DIFFs  is  passed  into  H2  as  found  in  (1).  In  this 
chromosome  replacing  Hi  MAX_SIG2  with  H3 
SUM  SIG2  and  H,  MAXSIG3  with  H3  SUM  SIG3 
would  yield  the  correct  result.  Finally,  the  desired 


result  is  found  in  generation  76. 

For  a  second  example  consider  the  DL  given 
T 

below  in  (21)  as  DL  ,  i.e.,  truth, 

OR2  OR3DELAYAND3  H,  SUM_SIG123 

H,  MAX_S1G2  H2  DIFF  MAXSIG2 

MAX  SIG123  OR3DELA  Y AND 3  H,  (21) 

SUMSIG123  H3  MAX_SIG3  H2  DIFF 
MAX  SIG3  MAX_SIG123. 

The  GP’s  evolutionary  process  for  inverting  (21)  is 
summarized  in  Table  3. 

This  example  is  similar  to  (1),  in  fact  if  in  (1)  the 
MAX  operations  are  replaced  by  the  SUM  operation 
and  SUM  replaced  by  MAX,  then  (21)  is  obtained. 
Given  that  (1)  and  (21)  only  differ  in  labeling  of  the 
underlying  graphs  it  is  anticipated  that  the  GP  based 
evolutionary  processes  that  yield  (1)  and  (21)  would 
be  similar.  This  anticipation  is  born  out,  but  there 
are  differences  in  the  evolutionary  processes.  One 
significant  difference  is  that  the  chromosome  in  (21) 
is  evolved  in  a  smaller  number  of  generations  than 
the  one  found  in  (1).  There  is  nothing  that  is 
obvious  about  the  rule  set  or  input-output  data  based 
used  for  both  chromosomes  that  would  favor  one 
over  the  other.  Experimentation  seems  to  indicate 
the  difference  in  the  number  of  generations  required 
is  related  to  the  seed  of  the  random  number 
generator. 

Just  as  with  the  example  in  Table  2,  the  best 
chromosome  of  the  first  generation  has  an  OR2  at 
the  end,  but  is  otherwise  too  short  and  far  removed 
from  the  correct  answer.  By  the  eighth  generation 
the  “H2  DIFF  MAX  SIG2  MAX_SIG123”  structure 
has  emerged.  The  best  chromosome  of  the  16th 
generation  preserves  the  best  features  of  previous 
generations  and  also  makes  use  of  an  OR3DELAY, 
but  it  still  has  many  defects.  For  all  generations 
after  the  26th  generation  the  elite  chromosome  has 
two  subgraphs  that  take  a  fonn  that  can  be  derived  in 
closed  form.  The  elite  chromosome  of  the  30lh 
generation  has  many  correct  labels  and  incorrect 
ones.  It  illustrates  how  evolution  can  fluctuate  from 
generation  to  generation  producing  individuals  of 
higher  fitness,  but  departing  significantly  from  the 
true  DL  in  form.  Finally  in  generation  46  the  GP 
converges  having  produced  the  correct  DL  design. 

As  referenced  above  it  is  possible  to  derive 


closed  form  exact  results  for  the  set  of  digital  logic 
diagrams  or  set  of  DL  maps  referred  to  as  DLS  that 
exactly  maximize  the  rule  score.  The  graph 
underlying  each  DL  diagram  is  isomorphic  in  the 

graph-theoretic  sense  to  the  graph  underlying  Dll . 
Furthermore,  for  certain  signal  types  it  is  possible  to 
write  down  closed  fonn  exact  results  for  the  image 
sets  under  these  DL  maps.  From  the  image  set 
closed  form  exact  entries  for  an  input-output 
database  can  be  derived  that  maximize  the  overall 
fitness. 

It  is  found  that  by  the  26th  generation  in  the 
computational  studies  of  Tables  2  and  3  that  the  GP 
finds  a  member  of  DLS.  After  the  26th  generation 
the  GP’s  elite  solutions  remain  within  DLS  each 
generation.  The  GP  spends  the  rest  of  the 
generations  until  it  converges,  re-labeling  the 

underlying  elite  graphs  eventually  evolving  Dll . 

By  selectively  eliminating  rules  from  the  rule  set, 
of  which  Table  1  is  a  subset  or  eliminating  rows 
from  the  derived  10  database,  the  effect  of 
uncertainty  can  be  studied.  The  elimination  of  rules 
represents  an  approach  to  the  determining  the  effect 
of  linguistic  imprecision,  i.e.,  the  inability  of  experts 
to  provide  crisp  rules.  The  random  loss  of  a  row  or 
rows  from  the  10  database  provides  a  model  of  the 
effect  of  uncertainty  born  of  randomness  during 
measurement. 

The  implications  for  the  two  kinds  of  uncertainty 
can  be  significantly  different.  Loss  of  a  rule  or  rules 
can  greatly  expand  the  set  of  DL  maps  that  will 
maximize  the  rule-score.  If  all  the  rules  are 
maintained,  but  rows  are  lost  from  the  10  database, 
then  the  ultimate  solution  can  be  quite  different  than 
truth,  but  the  underlying  graph  will  still  be 

isomorphic  to  Dll .  In  many  instances  the  real 
effect  of  loss  of  rows  from  the  IO  database  is  to 
interchange  thresholds  on  the  resulting  DL  map. 
When  this  occurs  input  signals  can  still  be  designed 
that  will  produce  a  desirable  output.  The  resulting 
signal  will  be  mathematically  similar  to  the  true  DL, 
but  more  signal  power  will  be  required.  So  the 
effect  of  certain  kinds  of  uncertainty  is  the 
requirement  for  more  power.  So  the  DL  map  has  an 
uncertainty  insensitivity  (UI)  up  to  power. 


Table  2:  Evolution  of  the  DL  depicted  in  Figure  1 


Generation 

Best  Chromosome  found  in  the  Population  for  the  Indicated  Generation 

1 

OR2  II 2  DIFF  SUM  SIG123  MAX  SIG123  OR2  DIFF  MAX  SIG2  H,  MAX  SIG2  SUM  SIG3 

25 

OR2  AND3DELAYOR3  H3  MAX  SIG2  H:  SUM  SIG123  If  DIFF  MAX  SIG2  MAX  SIG123 

AND  3  DELAY  OR3  If  SUM  SIG2  SUM  SIG3  H:  DIFF  MAX  SIG2  MAX  SIG123 

40 

OR2  OR3 DELAY AND3  Ht  MAX  SIG2  H,  MAX  SIG123  H:  DIFF  SUM  SIG2  SUM  SIG123 
AND3DELAY AND3  H3  SUM_SIG3  Ht  SUM_SIG2  H:  DIFF  SUM_SIG2  SUM_SIG3 

50 

OR2  OR3 DELAY AND3  H,  MAX  SIG2  H,  MAX  SIG123  H:  DIFF  SUM  SIG2  SUM  SIG123 
OR3DELAY AND3  H,  MAX_SIG3  H,  MAX_S1G123  H2  DIFF  SUM_S1G3  SUM_SIG123 

76 

OR2  OR3  DELAY  AND  3  H,  MAX  SIG123  H,  SUM  SIG2  H2  DIFF  SUM  SIG2  SUM  SIG123 

OR3  DELAY  AND3  H,  MAX  SIG123  H3  SUM  SIG3  H2  DIFF  SUM  SIG3  SUM  SIG123 

Table  3:  Evolution  of  the  DL  given  in  (21). 


Generation 

Best  Chromosome  found  in  the  Population  for  the  Indicated  Generation 

1 

OR2SUM  SIG2  AND3DELAY  DIFF  SUM  SIG2  SUM  SIG123 

8 

OR2  AND3DELAY  DIFF  SUM  SIG3  SUM  SIG123  AND3DELAY  H:  DIFF  MAX  SIG2  MAX  SIG123 

16 

OR2  AND3DELA  Y  OR3  H2  SUM  SIG2  SUM  SIG3  H2  DIFF  MAX  SIG2  MAX  SIG123  OR3DELA  Y  H, 
OR3  If  SUM  S1G2  SUM  SIG3  H2  DIFF  MAX  SIG2  MAX  SIG123 

30 

OR2  OR3DELAY  AND3  If  MAX  SIG123  H3  MAX  SIG2  H2  DIFF  MAX  SIG2  MAX  SIG123 
AND3DELAY AND3  H3  SUM_SIG2  H,  SUM_SIG123  H2  DIFF  SUM_SIG2  SUM_SIG3 

46 

OR2  OR3  DELAY  AND3  H,  SUM  SIG123  H3  MAX  SIG2  If  DIFF  MAX  SIG2  MAX  SIG123 
OR3DELAY  AND3  Ht  SUM  SIG123  H3  MAX  SIG3  H2  DIFF  MAX  SIG3  MAX  SIG123 

6  SUMMARY  AND  CONCLUSIONS 

Genetic  program  (GP)  based  data  mining  has 
proven  effective  for  reverse  engineering  the  complex 
digital  logic  underlying  sensor  devices  (SDs)  when 
the  original  design  specifications  for  these  devices 
are  unavailable  and  invasive  study  of  the  systems  is 
impossible. 

The  database  that  was  subjected  to  data  mining 
consisted  of  known  input  to  the  digital  logic  (DL), 
the  associated  measured  output  and  a  set  of  rules 
provided  by  experts  relating  to  their  assumptions 
about  the  digital  logic.  It  is  found  that  having  a  set 
of  expert  rules  in  the  database  is  essential;  the 
measured  output  of  the  digital  logic  is  rarely 
sufficient  to  uniquely  reverse  engineer  the  design. 

Experimental  observation  and  theoretical 
analysis  of  the  effects  of  uncertainty  show  that  even 
when  there  is  a  significant  reduction  in  the  quality  of 
input-output  measurement  infonnation:  the  DL  map 
evolved  by  the  GP  will  still  carry  enough 
infonnation  for  the  design  of  signals  with  specific 
properties.  The  creation  of  these  signals  is 
considered  of  greater  importance  than  having  the 
exact  DL  design  for  the  SD.  The  signals  frequently 
relate  to  the  detennination  of  control  rules  for 
platfonn  or  multiplatfonn  automation. 
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